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O^I ■ Abstract In this paper, simultaneous reduction of circuit depth and synthesis cost of reversible circuits in 

(^j^ quantum technologies with limited interaction is addressed. We developed a cycle-based synthesis algorithm 

^ ■ which uses negative controls and limited distance between gate lines. To improve circuit depth, a new 

parallel structure is introduced in which before synthesis a set of disjoint cycles are extracted from the input 
specification and distributed into some subsets. The cycles of each subset are synthesized independently on 
different sets of ancillae. Accordingly, each disjoint set can be synthesized by different synthesis methods. 
Our analysis shows that the best worst-case synthesis cost of reversible circuits in the linear nearest neighbor 
architecture is improved by the proposed approach. Our experimental results reveal the effectiveness of the 
' proposed approach to reduce cost and circuit depth for several benchmarks. 
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1 Introduction 



Boolean reversible circuits have attracted attention as components in several quantum algorithms including 
Shor's quantum factoring [1] and stabilizer circuits [2]. In the recent years, considerable efforts have been 
^ ■ made to synthesize a Boolean reversible function by a set of quantum gates [3] . 

{jf^ I The proposed technologies for quantum computing suffer from practical limitations for implementation. 

^S) ■ For example, popular quantum technologies allow computation on a few qubits in a linear nearest neighbor 

■ (LNN) architecture where only adjacent qubits can interact [3]. Additionally, physical qubits are fragile and 

can hold their states only for a limited time, called coherence time, [5]. To reflect technological constraints 
QQ ■ in the synthesis stage, different technology-specific cost metrics have been introduced. 



— Two-qubit cost is the number of two-qubit gates of any type and the number of one-qubit gates (reported 
separately) in a given circuit. The number of two-qubit gates for an n-qubit Toffoli gate (for n > 3) 
is estimated as lOn — 25 [6]. Quantum cost (QC) is the number of NOT, CNOT, controUed-V and 
controlled- gates required to implement a given reversible function. 

■ — Interaction cost is the distance between gate qubits for any two-qubit gate. Quantum circuit technologies 

5_( ■ with ID, 2D and 3D interactions exist [3]. Interaction cost for a circuit is calculated by a summation 

over the interaction costs of its gates. 

— Number of ancillae and garbage qubits reflect the limited number of qubits in the current quantum tech- 
nologies. 

— Depth is the largest number of elementary gates on any path from inputs to outputs in a circuit. Reducing 
circuit depth can increase coherence time. 

Synthesis of reversible Boolean circuits has an exponential search space. Consequently, many heuristic 
algorithms have been proposed to consider the effects of quantum cost and two-qubit cost in the synthe- 
sis stage [7-10]. Additionally, several post-process optimization methods have been developed to improve 
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quantum cost [8l llll [6]. interaction cost [121113] . and depth [Tl]. However, the number of algorithms which 
consider different parameters simultaneously — the focus of this work — is very limited. 

Besides technological limitations, studying theoretical aspects of circuits with either limited interactions 
among qubits of gates or limited depth attracts interest in complexity theory. For example, NC* is the 
class of decision problems solvable by a uniform family of Boolean circuits with polynomial size, depth of 
0(log' n) and fan-in=2. QNC is the class of constant-depth quantum circuits without fanout gates [15] . 

In this paper, a synthesis algorithm for Boolean reversible circuits is proposed which uses a cycle-based 
strategy to synthesize circuits for the LNN architecture. The proposed technique leads to improved synthesis 
costs as compared to the best prior methods for several benchmarks. Moreover, a parallel structure for 
reversible Boolean circuits is presented which significantly reduces circuit depth with 2n ancillae. Overall, 
our circuits can be considered as depth-optimized reversible circuits for the LNN architecture. 

This paper is organized as follows. Basic concepts are introduced in Section [21 Related synthesis and 
post-process optimization methods are reviewed in Section [3] The proposed cycle-based synthesis algorithm 
for the LNN architecture is described in Section [H Section [5] presents a parallel structure to reduce circuit 
depth. Experimental results are reported in Section [6l and Section [7] concludes the paper. 



2 Basic Concepts 

In this section, preliminary concepts are briefly introduced. Further background can be found in 

Permutation Function. Let B be any set and define f: B ^ B as a, one-to-one and onto transition 
function. The function / is a permutation function, as applying f to B leads to a set with the same elements 
of B and probably in a different order. If _B = {1,2,3, ...,m}, there exist two elements bi and bj belonging 
to B such that f{bi) = bj. A fc-cycle with length k is denoted as (fei, 62, fefc) which means that = 
b2,f{b2) = &3, and = 61. A given fc-cycle (fei, &2, could be written in different ways, such as 

(62, 63, - --bk, bi). Cycles ci and C2 are called disjoint if they have no common members. Any permutation can 
be written uniquely, except for the order, as a product of disjoint cycles. If two cycles ci and C2 are disjoint, 
they can commute, i.e., C1C2 = c2Ci. A cycle with length two is called transposition. A cycle or a permutation 
is called even (odd) if it can be written as an even (odd) number of transpositions. When fc-cycle is even 
(odd) then k is odd (even). 

Reversible Function. An n- input, n-output, fully specified Boolean function f:B^B over variables 
X = {xq, Xn-i} is called reversible if it maps each input pattern to a unique output pattern. Each 
reversible function can be considered as a permutation function. The added lines to a circuit are called 
ancillae and typically start out with a or 1. 

Reversible Gate. An n-input, n-output gate is reversible if it realizes a reversible function. A multiple- 
control Toffoli gate can be written as C™NOT(C; t), where C = {ii, . . . , im} is the set of control lines, t = {j} 
with C n f = is the target line and < i, j < n — 1. A control line may be positive (negative) which means 
that if its value is one (zero), the value of the target is inverted. For m=0 and m=l, the gates are called NOT 
(N) and CNOT (C), respectively. For m=2, the gate is called C^NOT or Toffoh (T). The SWAP(a,&) gate 
changes the value of two qubits a and b, and can be constructed by three CNOT gates C{a,b)C{b,a)C{a,b). 
The controUed-V (controlled- V^) gate changes the value of its target line using the transformation given by 
the matrix V (V^) if the control line has the value 1. 
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3 Related Work 

In this section, we review prior synthesis and optimization techniques that are used in this paper. 

In [16], an NCT-based synthesis method is proposed which decomposes a given cycle into a set of 
transpositions. To implement an arbitrary transposition (a, 6)(c, d) for distinct a, b, c, d ^ 0, 2*, the 
authors introduced three subcircuits, namely tt, kq and it~^ (the inverse of vr), where the kq circuit, 
C""2NOT(a2,...,a„-i;ao), implements a fixed transposition (2" - 4, 2" - 3) (2" - 2, 2" - 1). Accord- 
ingly, a synthesis algorithm was proposed to transform a, b, c and d to 2" — 4, 2" — 3, 2" — 2 and 2" — 1, 
respectively. By cascading tt, kq and tt"^, an arbitrary transposition can be implemented with quantum 
cost 34n - 64. 

The NCT-based synthesis method in [16] was extensively improved in [10], fe-cycle method hereafter. In 
the fc-cycle method, a given cycle of length > 6 is decomposed into a set of cycles of lengths < 6, called 
elementary cycles. Next, a set of synthesis algorithms was proposed to synthesize different elementary cycles. 
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(a) 



(b) 
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Fig. 1 (a) 3-input reversible full adder with optimal depth 4 | 14| . (b) the circuit in (a) after inserting SWAP gates and (c) 
reducing the number of SWAP gates by I12| . 



i.e., a pair of 2-cycles, a single 3-cycle, a pair of 3-cycles, a single 5-cycle, a pair of 5-cycles, a single 2-cycle 
(4-cycle) followed by a single 4-cycle (2-cycle) and a pair of 4-cycles. Similar to [16] , and 2* terms are fixed 
before synthesis because their effect on their synthesis results is negligible [10]. NCT gates with positive 
controls are used in both [16] and [10] . The effect of decomposition on the result of [10] was considered in 
[17] where a cycle-assignment technique based on graph matching was proposed. The worst-case quantum 
cost for synthesizing an arbitrary reversible function on n lines is 8.5n2" -|- o(2") in [lOj . 

In [T3], the authors introduced a post-process optimization algorithm to reduce the depth of a given 
quantum circuit. To achieve this, a set of circuit templates (circuit identities) was proposed to reduce 
quantum cost and circuit depth. The suggested templates are applied to change either gate locations or 
control/target positions in a subcircuit to parallelize more gates. The introduced templates were used by a 
greedy algorithm which starts from gate i and traverses the gates afterwards. At each step, the algorithm 
moves gates to left whenever possible and applies templates to check whether other gates can be moved to 
left or not. If no change is possible, it starts the same process from gate i + 1. 

In [12], a synthesis flow was proposed to improve the interaction cost of a given quantum circuit. The 
authors studied the exact synthesis of some small gates for the LNN architecture. The proposed optimal 
circuits are used to simplify larger circuits. Besides, some circuit templates are introduced to reduce the 
number of SWAP gates. Finally, local and global reordering of input qubits are considered to reorder gate 
qubits for improving the interaction cost. The proposed techniques were consolidated in a unified design 
flow to implement a given circuit with arbitrary interactions for architectures with limited interactions. 

Fig.dJa shows a 3-input full adder with depth 4 [ll] and six elementary gates. Actually, depth 4 is optimal 
since four qubits are involved in the fourth qubit [14j . Fig.[l]-b shows the same circuit after inserting SWAP 
gates to make the gate qubits adjacent with QC=24 and depth=23. Fig. [TJc illustrates the same circuit 
after applying the method in [T2] for reducing the number of SWAP gates where QC=18 and depth=17. 



4 The Proposed Cycle-Based Synthesis Method for Interaction Cost 

The main contribution of [10] is to propose a cycle-based synthesis approach with the primary focus on quan- 
tum cost as the sole metric considered. However, another important implementational constraint, namely 
interaction cost, is considered besides the quantum cost in our proposed cycle-based method in this section. 
To do that, we improve the fc-cycle method by using negative controls and adapting the synthesis algorithms 
of elementary cycles to the LNN architecture. Particularly, two new elementary odd cycles, a 2-cycle and 
a 4-cycle, are included to improve quantum cost. These odd cycles are synthesized as a pair of 2-cycles 
and a pair of 4-cycles in [10] with one ancilla. Odd cycles need one ancilla in the NCT library for the 
implementation [16]. In our experiments, we used this ancilla for the decomposition of complex gates into 
elementary gates. Additionally, and 2* terms are not fixed before synthesis to be used in the proposed 
parallel structure as discussed in Section [5] 

Negative controls can reduce the number of elementary gates in the kq, tt and vr^^ circuits both with 
and without considering nearest neighbor restriction. Multiple-control Toffoli gates with at least one positive 
control can be simulated as efficiently as complex Toffoli gates with only positive controls [Tl]. By using 
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(a) 



(b) 



(c) 



Fig. 2 (a) The Ko(2,2) circuit in I16II10I . (b) The proposed fto(2,2) circuit. Each control at position i,0<i<n — fc + 2 

is negative, (c) An example of tt circuit in 1101 . ag is used to control CNOTs in the first part. The second subcircuit is the 
circuit in 1101 Theorem 3.1]. (d) An example of tt circuit in the proposed method. Here, fc=3. Refer to Tabled 



Algorithm 1: Gate selection in the tt circuit 
Input: 

L n-bit input terms. Bit value at position i of the j'-th input term is bf^^ jy 

L n-bit Ko terms. Bit value at position i of the j-th kq term is b'^f 

Pivot is the boldfaced position in the intermediate terms in Table [l] 
Output: The tt circuit, 
for i in to L do 

if ''{i,Pi«ot)7^1 then 
Set 

b(i, Pivot) — t t>y either a CNOT or a Toffoli gate; 

end 

for j in to Pivot do 
if hiJ)y^ ^llj) *hen 

Find a position p: 6(j p)=t; ''(fe p)7^1 < *)i \P ~ i\ the minimum possible value, and p < Pivot; 
Apply CNOT(p;j); 
end 

end 

for j in n-1 to Pivot +1 do 

Find a position p: 6(j p)=li ''(fe p)7^1 < '0> |P ~ il is the minimum possible value, and p > Pivot; 
Apply CNOT(p;i); 

end 

end 
end 



CNOT and Toffoli gates with negative controls, one may not fix and 2* terms before synthesis as compared 
with the methods in [161110] . 

Cycle Construction Length (CCL) is defined as the number of lines required to implement a given 
cycle of length L. In theory, the minimum CCL is logj L. To implement the elementary cycles by NCT 
gates, at most two more lines are required in the proposed approach — one to avoid Toffoli gates without 
any positive control in the kq circuit, and one to improve circuit cost in the tt, circuits. Accordingly, we 
set CCL(2)=2, CCL(2,2)=4, CCL(3)=3, CCL(3_3)=5, CCL(4)=4, CCL(4^2)=5, CCL(4_4)=5, CCL(5)=5 and 
CCL(5 5)=6. For an n-line circuit, lines required to construct a given cycle, CCL in total, can be selected in 
n X (n — 1) X ... x{n — CCL — 1) different ways. To improve interaction cost and depth we place the selected 
lines close to each other in the middle of the kq circuit at positions fc, fc ± 1 and fc ± 2 for k = [n/2J . Details 
are discussed later. 

To synthesize a given elementary cycle, one needs to change input terms into the terms specified by the 
kq circuit. This is done by converting the input terms into intermediate terms specified by the n circuit. 
Afterwards, the intermediate terms are transformed into kq terms by a few specific gates, called static gates. 
In the proposed method, the control and target lines in the tt circuit are selected such that interaction cost 
can be reduced. Since kq cycles are constructed in the middle of the circuit and the intermediate terms are 
designed with at least one "1", as boldfaced in column Int. Terms in Table [T] it is possible to select control 
and target lines of each gate with length < \{n — CCL) /2 + CCL] . Considering two SWAP gates with cost 6 
leads to QC^jviv < 3(n-|-CCL) for each gate. To reduce circuit depth, the gates required to fix bit positions 
at the first half and the second half are applied in parallel. Algorithm [1] provides the details. 
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The kq(-2.2) circuits in [161110] and the proposed ko(2.2) circuit are shown in Fig. [2]-a and Fig. [2]-b, 
respectively. Fig. [2]-c illustrates one example of the tt circuit in [10]. The input term is "11110111" which 
should be changed to the second term in the Ko(2,2) circuit in [TU], i.e., "11111101". This is done by a circuit 
with QC=16 and depth=ll. In contrast, "11110111" should be changed to "00100100" in the proposed 
method. Fig. [2]-d shows the tv circuit with QC=5 and depth=3 based on Algorithm 1. 

4.1 Building Blocks 

In this section, direct synthesis of the suggested elementary cycles, i.e., (2), (2,2), (3), (3,3), (4,2), (4,4), 
(5), (5,5), is discussed. Fig. [3] illustrates the kq circuits of all elementary cycles. We give a full description 
of the synthesis method for a pair of 2-cycles first. 

(2,2)-synthesis: To change (a,6)(c, d) to k-o(2.2) terms: 

— At most n NOT gates can be used to convert a to "0...1000...0" . Other terms b, c, and d may be changed 
to new terms b' , c' and d' , respectively. 

— At most one CNOT gate conditioned on either the i-th line ij^k + 2 (positive) or i = fc + 2 (negative) 
can be used to set the [k ~ l)-th bit of b' . Next, at most n — 1 CNOT gates conditioned on the {k — l)-th 
bit can be applied to change the j-th bit of 6' (0 < j < n — 1, jy^k — 1) to "0...1001...0" . c', and d' may 
be changed to new terms c" and d" . 

— At most one CNOT gate conditioned on either the i-th line iy^k + 2 (positive) or i = fc + 2 (negative) 
can be used to set the k-th bit of c". Next, at most n — 1 CNOT gates with positive control conditioned 
on the k-th bit can be applied to change the j-th bit of c" {0 < j < n — 1, j k) to "0...1010...0" . The 
last term d" may be changed to a new term d'" . 

— At most one CNOT gate conditioned on either the i-th line i^k + 2 (positive) or i = + 2 (negative) can 
be used to set the {k+l)-th bit ofd'". Next, atmostn — 1 CNOT gates with positive control conditioned 
on the {k + l)-th bit can be applied to change the j-th bit of rf'" (0 < j < n- 1, fc-|-2) to "0...1111...0" . 

— A Toffoli gate conditioned on the {k — l)-th and the k-th lines can be used to set the {k + l)-th line. 
Therefore, it changes "0...1111...0" to "0...1011...0" . 

Note that converting each term does not corrupt the previously fixed terms. The same number of gates are 
needed for the ■k~^ circuit. Accordingly, a total number of 8n + 22 elementary gates are required for the tt 
and TT-^ circuits. The circuit in Fig.[3^b implements (2''+2,2'=+2+2'=-i)(2'=+2 + 2'=,2''+2 + 2'= + 2'=-i) with 
cost 24n — 88. Therefore, an arbitrary pair of 2-cycles (a, 6)(c, d) can be implemented by at most 32n — 66 
elementary gates. 

Following the above discussion for the (2,2)-synthesis method, details for the synthesis of other ele- 
mentary cycles are given in Table [1] In this table, subscripts in column Input Cycle(s) denote orders in 
considering each term. Intermediate terms are represented by binary expansions with LSB on the right and 
the underlined bit in the k-th position (fc=[^J). The boldfaced "1" is Pivot in Algorithm [1] for each term. 
The parenthesized pairs in column Max. Cost represent CNOT count with negative and positive controls, 
respectively. The numbers given in column Terms for the circuit are bit positions with value "1" in binary 
representation. Table [2] reports the resulting quantum cost of each elementary cycle. As can be seen, the 
total number of elementary gates is improved by a linear factor in most cases. Considering the worst-case 
cost of 3(n-|- CCL) for each gate in the tt and tt^^ circuits in the LNN architecture and Qn — 12 elementary 
gates (i.e., two chains of n — 2 SWAP gates) for the kq circuits leads to the results given in Total Cost (LNN) 
column in Table [2l 

4.2 Worst-Case Analysis 

In this section, an upper bound on the number of gates in the proposed cycle-based method is calculated. To 
achieve this, let all terms of a truth table be involved in the input cycles to have a cycle with the maximum 
length 2" for an n-input/n-output function. To convert a cycle with length>5 to a set of elementary cycles, 
we may have some repeated terms in non-disjoint cycles. As such, shows the maximum number of 

terms where ar is the maximum number of repeated terms and can be estimated as ar = "''"^"'"'^ , ao = ^ 

which resuhs in Or = V * ^ ~^ = 2"~^ -|-log5(2-^) - |. Theorem [1] discusses the maximum 

number of elementary gates in our approach. 

Theorem 1 The maximum number of elementary gates for any permutation in the proposed approach is 9.4n2" — 
18.82" + o(n^) and 42.4n^2" -|- o(n^) without and with considering interaction cost, respectively. 
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Table 1 Direct synthesis of elementary cycles. Subscripts in the input cycles denote the orders in considering each term. 
The underlined bit in the k-th position (A;=[^J). The boldfaced "1" is Pivot in Algorithm 1. Numbers given for kq terms 
arc bit positions with "1" in the binary expansion. 



Input Cycle(s) 



Int. Terms 



7r or 7r 
Max. Cost 



Circuit 

Static Gates 



Ko Circuit 



Terms 



0...10...0) 
0...11...0) 



n N 

n(l,n-l) C 



fc + 1 
fc - 1 



(fe + 1) 



(£3,^4) 



0...1000...0) 
0...1001...0) 
0...1010...0) 
0...1111...0) 



n N 

n(l,n-l) C 
n(l,n-l) C 
n(l,n-l) C 



T(fc- l,fc;fc + l) 



fe + 2 
fc + 2 
fe + 2 
fe + 2 



fc - 1) 
fc) 

fc)(fc - 1) 



(ii, 62, C3) 



0...001...0) 
0...101...0) 
0...111...0) 



n N 

n(l,n-l) C 
n(l,n-l) C 



fc - 1 
fc + 1 
fc + 1 



fc - 1) 
fc)(fc - 1) 



(ai,fe2,C3) 
{di, 65. /e) 



0.. . 00001. ..0) 
0.. . 00011. ..0) 
0.. . 00111. ..0) 
0.. . 10001. ..0) 
0.. . 11011. ..0) 
0.. . 11111. ..0) 



n N 

n(l,n-l) C 
n(l,n-l) C 
1 T, n-1 C 
1 T, n-1 C 
1 T, n-1 C 



T(fc - l,fc-|-2;fc + 1) 
T(fc, fc -I- 2; fc -I- 1) 



fc - 2 
fc - 1 
fc)(fc - 
fc-l-2 
fc-l-2^ 
fc + 2 



fc - 2) 
l)(fc-2) 
fc - 2) 

fc - l)(fc - 2) 
fc)(fc - l)(fc - 2) 



(ai, 62, C3, ^4) 



0...1000...0) 
0...1001...0) 
0...1010...0) 
0...1111...0) 



n N 
n(l,n-l 
n(l,n-l 
n(l,n-l 



T(fc - l,fc;fc + 1) 



fe-l-2 
fc - 1 
fc)(fc - 
fc - 1 



fc + 2) 
2) 

fc)(fc-t-2) 



(ai, 62, £3,^4) 
(e5,/6) 



0.. . 10000.. .0) 
0.. . 10010.. .0) 
0.. . 10100.. .0) 
0.. . 11110.. .0) 
0.. . 10111. ..0) 
0.. . 11011. ..0) 



n N 

n(l,n-l 

n(l,n-l 

n(l,n-l 

n(l,n-l 

1 T, n- 



T(fc- l,fc;fc-|-l) 
T(fc-2,fc';fc-|-l) 



fc-l-2 
fc-l-2 
fe-l-2 
fe-l-2 
fe-l-2 
fe-l-2 



fc - 1) 
fe) 

fe)(fe - 1) 

fe)(fe - l)(fe - 2) 

fe - l)(fc - 2) 



(ai, 62, £3,^4) 
{£5, 16,97, ha) 



0.. . 10000.. .0) 
0.. . 10001. ..0) 
0.. . 10010.. .0) 
0.. . 10111. ..0) 
0.. . 10100.. .0) 
0.. . 11101. ..0) 
0.. . 11110.. .0) 
0.. . 11111. ..0) 



n N 
n(l,n-l 
n(l,n-l 
n(l,n-l 
n(l,n-l 
1 T, n- 
1 T, n- 
n(l,n-l 



T{fc - 2,fc - l;fe) 
T{fc - 2, fc; fc -I- 1) 
T(fc- l,fc;fc-|-l) 
T(fc - 2,fc - l,fc;fc -I- 1) 



fc-l-2 
fc-l-2 
fe-l-2 
fe-l-2 
fe-l-2 
fe-l-2 
fe-l-2 
fe-l-2 



fe - 2) 
fe - 1) 

fe - l)(fc - 2) 
fc) 

fc)(fc - 2) 
fc)(fc - 1) 
fc)(fc - l)(fc - 2) 



(ai, 64, C2, rfs, £5) 



0.. . 10000.. .0) 
0.. . 10001. ..0) 
0.. . 10010.. .0) 
0.. . 11011. ..0) 
0.. . 10111. ..0) 



n N 

n(l,n-l 

n(l,n-l 

n(l,n-l 

n(l,n-l 



T{fc - 2,fc - l;fc + 1) 



fc + 2 
fc + 2 
fc + 2 
fc + 2 
fc + 2 



fc - 2) 
fc - 1) 

fc - l){fc - 2) 
fc)(fc - l)(fc - 2) 



(ai, 64, C2, cia, eio) 
{f5,9a,h6,i7,j9) 



0... 100000, 
0... 100001, 
0... 100010, 
0... 100111, 
0... 101000, 
0... 111001, 
0... 111010, 
0... 111011, 
0... 101111, 
0... 110111, 



n N 

n(l,n-l 

n(l,n-l 

n(l,n-l 

n(l,n-l 

1 T, n- 

1 T, n- 

n(l,n-l 

n(l,n-l 

1 T, n-1 



T{fc - 3,fc - 2;fc - 1) 

T(fc-3,fc;fc + l) 
T(fc-2,fc;fc + l) 
T{fc - 2,fc - l,fc;fc + 1) 

T(fc - l,fc';fc + 1) 



fc + 2 
fc + 2 
fc + 2 
fc + 2 
fc + 2 



fc + 2 



fc-3) 
fc - 2) 

fc - 2){fc - 3) 
fc) 

fc)(fc - 3) 

fc)(fc - 2) 

fc)(fc - 2){fc - 3) 

fc)(fc - l){fc - 2)(fc - 3) 

fc - l){fc - 2)(fc - 3) 



Table 2 Worst-case costs for elementary cycles. 



EC 


Length 




n, n ^ 


The Proposed Method 

Total Cost Cost/Length 


Total Cost (LNN) 


Total Cost 


OOj 

Cost/Length 


(2) 


2 


24n-64 


2n+2 


28n-60 


14n-30 


145n^-666n+772 


34n-30 


17n-15 


(2,2) 


4 


24n-88 


4n+ll 


32n-66 


8n-16.5 


147n^-791n+1100 


34n-64 


8.5n-16 


(3) 


3 


24n-88 


3n+4 


30n-80 


lOn-26.7 


146n^-804n+1068 


32n-82 


10.7n-27.3 


(3,3) 


6 


24n-112 


6n+26 


36n-60 


6n-10 


149n^-907n+1474 


38n-46 


6.3n-15.3 


(4) 


4 


48n-152 


4n+ll 


56n-130 


14n-32.5 


291n^-1463n+1868 


50n-84 


12.5n-21 


(4,2) 


6 


36n-204 


6n+14 


48n-176 


8n-29.4 


221n^-1615n+2483 


50n-122 


8.3n-20.3 


(4,4) 


8 


36n-204 


8n+46 


52n-112 


6.5n-14 


223n^-1573n+2678 


56n-126 


7n-15.7 


(5) 


5 


48n-166 


5n+13 


58n-140 


11.6n-28 


292n^-1537n+2057 


60n-130 


12n-26 


(5,5) 


10 


36n-204 


lOn+57 


56n-90 


5.6n-9 


225n^-319n+2790 


64n-54 


6.4n-5.4 
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Fig. 3 The kq circuit structures for different elementary cycles. The circuit structures for cycles (2,2), (3), (3,3), (4,2), 
(4,4), (5), and (5,5) are similar to those proposed in 1101 . The new circuits for (2) and (4) besides the application of negative 
controls and the revised terms in the kq circuits improve quantum cost and interaction cost. 



Proof In Table [21 the column Cost/Length determines a cost needed for setting a term in each elementary 
cycle. To calculate the maximum cost, suppose at most one 3-cycle, one 4-cycle and one 5-cycle are included 
which can be synthesized by the related synthesis algorithms. All other terms are supposed to be synthesized 
as pairs of 2-cycles. Note that the number of elementary gates for fixing terms in a pair of 2-cycles is greater 
than any other pairs (See Table [2]). The repeated terms in non-disjoint 5-cycles are synthesized by the 
(5, 5) -cycle synthesis method. 

Accordingly we will have, 3x Cost/Lengthy AxCost/Lengthi 5x Cost/Lengthy + (2'"' — \2)xCost/Length2.2 
+ arX C ost/ Lengthy ^y which leads to 9.4n2" — 18.8 x 2" + 2.8n^ -|-43.5n— 152.1 elementary gates in the worst- 
case with arbitrary interaction and 42.4n^2" -|- 11. 3n^ -|- 288. 2n^ with limited interaction. 

The worst-case quantum cost of [10] is 51n^2" for architectures with limited interaction. 



5 Synthesis with Parallel Structure 

In this section, a parallel circuit structure is introduced for reversible logic that can be used to considerably 
reduce circuit depth of reversible circuits in most cases. The general idea is to copy input lines into k sets 
of zero-initialized ancillae, divide the input specification into k sets of disjoint cycles and then synthesize 
each set independently by using the prepared ancillae. The final results can be recovered by several CNOTs. 
It should be mentioned that adding ancillae has been previously used for quantum cost reduction in the 
synthesis and optimization methods [9l[6]. In the proposed method, ancillae are used for the propose of 
depth reduction without considerable overhead on quantum cost, thanks to the specific form of input 
representation, i.e., cycle. Note that each cycle can be synthesized by a different synthesis method. 
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Fig. 4 (a) The input storing block with linear depth, (b) An alternative circuit structure with improved interaction cost 
and linear depth, (c) A logarithmic-depth circuit structure. 



Input Storing Block. Copying an arbitrary quantum state is not possible in general but a Boolean value 
can be copied into a zero-initialized ancilla by a CNOT gate conditioned on the main line and targeted 
on the ancilla. For m n-line zero-initialized ancillae, the input storing block includes mn CNOT gates with 
constant depth m. Fig.|3]-a shows the input storing block for a circuit with n main lines and m n-line ancillae. 
The interaction cost can be calculated as n{n — 1)(1 -|- 2 -|- ... + m — 1) = (l/2)nm{n — l)(m — 1). Fig. [3]-b 
illustrates another circuit structure with improved interaction cost, mn{n — 1). Circuit depth in Fig.|31-a can 
be improved from linear factor to logarithmic factor O(logm) [15] as shown in Fig. |l]-c. Thus, interaction 
cost can be calculated as n(n - 1) X^l^o """^ = (l/2)n(n - \){m'^ - 2). 

Output Restoring Block. Since each subcircuit implements a set of disjoint cycles, for a given input 
combination, only one circuit (active) produces the results and the outputs of other subcircuits (inactive) 
are the same as the inputs. The number of inactive subcircuits is equal to the number of n-line ancillae 
registers, which is even. As such, XORing (by CNOT) the outputs of all subcircuits on the main lines cancels 
inputs and restores correct outputs at the main lines. Overall, for m n-line ancillae and m+1 sets of disjoint 
cycles, mn CNOTs with depth m are sufficient. Fig.[5l-a illustrates the output restoring block for m n-line 
ancillae with interaction cost nm{n — l)(m — 1). CNOT-circuit with common target can be implemented 
with logarithmic depth [T3] as illustrated in Fig. [S]-b for n=4 and m=4. In this case, interaction cost is 
n(n- l)X;l!fo"~^ 2^ - l/2*+i = (l/2)nm(n - l)(2m -h logj m -h 2). 

Theorem 2 Consider a given specification F on n lines written as a set of disjoint cycles CiC2---Cm for an odd 
m. Assume that subcircuit Lj implements Ci. The specification F can be implemented with depth O (depthmax{Li) ) 
in the presence of m n-line ancillae. 

Proof Copying the input lines to m — 1 ?i-line zero-initialized ancillae replicates inputs at the ancillae. 
Disjoint cycles commute. Hence, each subcircuit can be implemented on one register independently. The 
input storing/output restoring blocks have constant depth m. Therefore, circuit depth is dominated by the 
maximum depth of all subcircuits. 
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Fig. 5 (a) The output restoring block with linear depth, (b) The output restoring block with logarithmic depth for four 
main lines and four 4-line ancillae. 
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Fig. 6 An example of the proposed parallel cycle-based structure for a 4-line function. 



A given specification may contain a set of disjoint cycles witli exponential lengths, i.e., 0(2"). In such 
cases, circuit depth cannot be further improved by Theorem (2] However, as will be shown in Section [6l 
circuit depth can be reduced considerably even with a small number of n-line ancillae. To efficiently employ 
the result of Theorem [2j one needs to determine disjoint cycle sets. 

Example 1 Assume that the input cycles (1,3) (7,10) (0,4) (6,15) (2,8) (5,13) are given for a circuit with 4 lines. 
All cycles are elementary and no decomposition is required. Let 2 ^-iine ancillae be available and each pair of 
2-cycles be assigned to one set, i.e., (1,3) (7,10) to set #1, (0,4) (6,15) to set #2 and (2,8) (5,13) to set #3. 
Applying the input storing block provides the input data on the added zero-initialized ancillae. Now, the proposed 
method in Section^can be applied for each cycle pair which leads to three subcircuits. To combine the results, one 
needs to add the output restoring block. Accordingly, total depth is equal to the maximum depth of the synthesized 
subcircuits (i.e., 33) plus 4 (2 for each input storing/output restoring block). Fig. [U^ illustrates the result. 

Cycle Distribution. Consider n elementary cycles and m register sets, including the input register. The 
problem is to assign disjoint cycles into different registers such that the total depth of the circuit in each 
register is minimized and the depths of the registers are almost equal. To achieve this goal, we modeled the 
cycle distribution problem as the bin packing problerrQ with a few exceptions. In our modeling, registers 
are bins and cycles are objects. Each cycle is decomposed into a set of elementary cycles and cost values in 
Table [2] are used as the weights of elementary cycles. If the input permutation is odd, the permutation in 
one bin should be odd. Many heuristic algorithms have been developed to solve different variants of the bin 
packing problem. Examples include first fit and best fit algorithms. 

^ Bin packing problem is a combinatorial NP-hard problem in computational complexity theory in which objects of 
different weights must be packed into a finite number of bins of capacity W such that the number of used bins are 
minimized. Given a bin of size W and a list wi, ...,Wn of sizes of the items, one should find an integer B and a B- partition 
Si U ... U Sb of {Ij ...,n} such that ^^gg Wi <W for all A: = 1, B. A solution is optimal if it has minimal B. 
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Table 3 Benchmark specifications before and after decomposition. 



Benchmark 
Function 


Bcfoi 
EC 


e DCM 
nEC 


(2) 


# 

After 
(3) 


of Cy 
DCM 
(4) 


:;lcs 
(5) 


After 
setl 


DCM i 
set 2 


k DIST 

sct3 


circl 


De 
circ2 


5th 

circ3 


total 


hwbS 


48 


16 


36 




28 


16 


26 


28 


26 


1923 


1995 


1953 


1999 


hwb9 


54 


38 




54 




76 


43 


43 


44 


4344 


4347 


3988 


4351 


hwblO 


228 


26 


152 






154 


101 


103 


102 


9929 


10058 


9898 


10062 


hwbll 




186 




186 




372 


186 


186 


186 


23862 


23826 


23827 


23866 


nth_prime7 




5 


1 


3 


1 


28 


19 


6 


8 


1519 


390 


734 


1523 


nth_prime8 




1 




1 




62 


63 






5852 






5852 


nth_primc9 


1 


3 


2 




1 


125 


122 


4 


2 


13783 


393 


346 


13787 


nth_primclO 


1 


3 


1 


2 




253 


96 


85 


75 


12115 


10947 


9329 


12119 


nth_primell 




6 


2 




1 


507 


315 


36 


159 


46888 


5470 


23765 


46892 



To solve the problem, a best fit algorithm is developed which sorts c elementary cycles according to their 
maximum synthesis costs and proceeds one cycle at a time. To distribute cycles, the first cycle is selected 
and temporarily assigned to bin i for 1 < i < m. Then, the total cost is calculated among all the bins and 
the cycle is permanently assigned to the bin which results in the lowest total cost. In the case of a tie, the 
bins are selected in sequence. The algorithm continues until all the cycles are assigned. Therefore, the total 
time complexity is O(clogc) + O(cm^). At the end, the algorithm checks the permutation of each bin to 
make sure that at most one bin has an odd permutation. Odd permutations need one ancilla in the NCT 
library |16| . If more than one bin is found with an odd permutation (called odd bin), the algorithm moves 
the smallest odd cycle of the odd bin with maximum depth to the odd bin with the minimum depth. This 
can take 0(m) time. After the changes, the involved bins should have even permutations. This process is 
continued until at most one bin with an odd permutation exists — this occurs when the input permutation 
is odd and at least m even permutations exist to fill all the bins. Altogether, the whole process has a time 
complexity of 0(c log c) + 0{cm?). 

6 Experimental Results 

The proposed cycle-based synthesis method for the LNN architecture and the suggested parallel structure 
for reversible logic synthesis were implemented in C+-|- and all of the experiments were performed on 
an Intel Pentium IV 2.5GHz computer with 4GB memory. To evaluate the proposed synthesis method, 
some of the reversible benchmark functions from [TS] were synthesized. The selection criteria for these 
benchmarks will be discussed later in this section and their specifications are given in Table |3] before and 
after decomposition. The decomposition approach of [10] is used in our method to decompose the input 
cycles into the proposed elementary cycles. The number of elementary cycles (EC) and non-elementary 
cycles (nEC) of each benchmark is reported is this table. After decomposition, all cycles are elementary 
with length<6. Note that [10] proposes the best prior synthesis algorithm for medium-size hwhN and N-th 
prime functions if no ancilla is available |18| . While hwhN functions can be implemented with a polynomial 
cost 0{n\o^ n) if a logarithmic number of garbage bits [logji] + 1 is available [18], the proposed approach 
is more general and can be applied to many reversible functions. 

To evaluate the proposed parallel structure, the cycle-based algorithm of Section H] was used for synthe- 
sizing each subset. Since the number of signals is limited in the current quantum technologies, the minimum 
number of ancillae (2 n-line registers) was used. Therefore, the number of input cycles should be >3 to 
have at least one cycle in each subset. In our experiments, the results of [T3] were used for decomposing 
multiple-control Toffoli gates and calculating quantum cost for the gates with negative controls. Besides, 
the two-qubit cost model of [6] is used for evaluating the results. A naive SWAP insertion method and the 
method of [12] were used to evaluate the results for the LNN architecture. For the naive method, move 
and delete rules were applied on the synthesized circuits to remove redundant gates. To estimate circuit 
depth, the greedy level compaction algorithm of [ll] was implemented without applying the templates. 
Tableland Table [5] report the quantum cost (QC), the two-qubit cost (2-qubit) and the depth (Depth) 
for the synthesized circuits without and with limited interaction. Since [10] does not target the limited 
interaction in the LNN architecture, we used the method of [T^] on the results of [TU] and ours to insert 
SWAP gates. Runtime of [10] and our method is less than one minute for the selected benchmarks. In 
the proposed method, this time includes the time required for applying the distribution procedure in the 
parallel structure and the time required for synthesis and applying the move and delete rules. In the parallel 
structure, due to the qubit reordering in [12], at most 3n(3n — 1) SWAP gates are used between the input 
storing block, the subsets and the output restoring block to order lines. 
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Table 4 Comparison of the proposed approach and prior best results. #:A is the number of ancillae. R and P are used for 
regular and parallel structures, respectively. The resulted circuits are available at http: // celt . aut . ac.ir/ " arabzadeh / results / , 
and may be viewed with RCViewer-|- 1191 . 



Benchmark 
Function 


n 


R/P 


The 

# A 


Proposec 
QC 


1 Method 
2-qubit 


Depth 


QC 


[To] 

2-qubit 


Depth 


Im 

QC 


provemenf 
2-qubit 


(%) 
Depth 


hwbS 


8 


R 
P 


16 


6686 
6964 


4468 
4730 


5622 
1999 


6940 


5348 


5442 


3.6 
-0.3 


16.4 

11.5 


-3.3 
63.2 


hwb9 


9 


R 
P 


18 


14474 
15262 


10382 
10764 


12054 
4351 


16173 


12479 


12472 


10.5 
5.6 


16.8 
13.7 


3.3 
65.1 


hwblO 


10 


R 
P 


20 


35298 
35890 


23584 
23874 


29751 
10062 


35618 


25453 


27812 


0.8 
-0.7 


7.3 
6.2 


-6.9 
63.8 


hwbll 


11 


R 
P 


22 


86864 
87234 


65260 
65442 


71418 
23866 


90745 


71175 


69763 


4.2 
3.8 


8.3 
8.0 


-2.3 
65.7 


nth_primc7 


7 


R 
P 


14 


2888 
3100 


2296 
2398 


2473 
1523 


3172 


2841 


2514 


8.9 
2.2 


19.1 
15.5 


1.6 
39.4 


nth_prime8 


8 


R 
P 




7016 


5624 


5852 


7618 


6622 


5793 


7.9 


15.0 


-1.0 


nth_prime9 


9 


R 
P 


18 


16820 
17507 


11907 
12053 


14285 
13787 


17975 


14076 


13941 


6.4 
2.6 


15.4 
14.3 


-2.4 
1.1 


nth_primelO 


10 


R 
P 


20 


38843 
39317 


27743 
27933 


31924 
12119 


40301 


31841 


31254 


3.6 
2.4 


12.8 
12.2 


-2.1 
61.2 


nth_primcll 


11 


R 
P 


22 


92863 
93389 


67401 
67677 


75668 
46892 


95433 


75474 


72934 


2.6 
2.1 


10.6 
10.3 


-3.7 
35.7 


Average 


5.4 
2.2 


13.6 
11.5 


-1.9 
49.4 



Table 5 Comparison of the proposed approach and the one in 1101 with the nearest neighbor limitation. The im- 
provment column compares the results after applying 1121 on both methods. The resulted circuits are available at 
|http: / / celt . aut .ac.ir/ " arabzadeh / results / ^ and may be viewed with RCViewer+ 1191 . 



Benchmark 
Function 


n 


R/P 


# A 


The Pro) 
QC 


30sed Met 
aive 
Depth 


hod 

+ [ 

QC 


m 

Depth 


[10]- 

QC 


f[T2] 

Depth 


Impro 
QC 


vemcnt (%) 
Depth 


hwb8 


8 


R 
P 


16 


36684 
46788 


32313 
14758 


31553 
36045 


20940 
9248 


36732 


22720 


14.0 
1.8 


7.8 
59.2 


hwb9 


9 


R 
P 


18 


87310 
100228 


74676 
31810 


77860 
87389 


46958 
19597 


91805 


51181 


15.1 
4.8 


8.2 
61.7 


hwblO 


10 


R 
P 


20 


279496 
291014 


248524 
89021 


202903 
212616 


112623 
41479 


228240 


117893 


11.1 

6.8 


4.4 

64.8 


hwbll 


11 


R 
P 


22 


682182 
685944 


605294 
205472 


562817 
569876 


297986 
104372 


611843 


307114 


8.0 
6.8 


2.9 
66.0 


nth_prime7 


7 


R 
P 


14 


12264 
15106 


10649 
7734 


10922 
15897 


9799 
6930 


15356 


10130 


28.8 
-3.5 


3.2 
31.5 


nth_prime8 


8 


R 
P 




35976 


29975 


30796 


26920 


42059 


24574 


26.7 


-9.5 


nth_prime9 


9 


R 
P 


18 


91984 
98686 


76910 
76020 


90511 
95362 


54850 


99003 


55737 


8.5 
3.6 


2.2 
1.5 


nth_primelO 


10 


R 
P 


20 


241538 
250526 


199996 
79165 


222865 
228777 


124122 
49613 


248901 


137091 


10.4 
8.0 


9.4 

63.8 


nth_primell 


11 


R 
P 


22 


654910 
665132 


577721 
361756 


576047 
585165 


308413 
195500 


625320 


324005 


7.8 
6.4 


4.8 

39.6 


Average 


14.6 
4.4 


3.8 
48.6 



As can be seen in Table [S] the effect of tlie post-process metliod is more significant for [10] but alto- 
gether the results of the proposed LNN-based method are better than those of [TU] after applying [T^ on 
both methods. Notice that using negative controls does not allow to increase the quantum cost. For odd 
permutations, one more ancilla should be added. The two-qubit costs are compared in Table [3] and the 
results show 13.6% and 11.5% improvement on average for the regular and parallel structures, respectively. 
In the parallel structure, the average depth improvement of the N-th prime benchmarks is less than that of 
hwbN functions since the input cycles of those functions are unstructured with different cycle lengths which 
result in unbalanced subsets after distribution. Input cycle distributions after decomposition (DCM) and 
distribution (DIST) are reported in Table [S] For hwbN functions, applying the distribution method leads to 
3 sets with almost the same numbers of elementary cycles. We report the circuit depth for each set along 
with the total depth after considering the effect of input storing and output restoring blocks in this table. As 
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reported in Table [3l function nth^primeS has one disjoint input cycle. Accordingly, the resulting elementary 
cycles should be assigned to one set by the proposed method. 

In choosing the benchmark functions that were considered in this paper, the general guidelines presented 
in [To] and [3] were considered. These guidelines stipulate that one of the scenarios in which the cycle-based 
methods render significantly superior results is when the input function contains permutations without 
regular patterns such as hwbN, N-th prime [10] functions. For this reason, only the results of these functions 
are reported in this paper. As for other functions in [18], some are reported in [10] along with a discussion 
on their suitability for the cycle-based approach (like Permanent). To avoid being repetitive, we did not 
include this set in this paper. There are yet other benchmarks that include important arithmetic functions 
like adders, multipliers and group arithmetic (e.g., in Galois Fields). Since the proposed cycle-based synthesis 
method is a general synthesis approach, it may not produce interesting results compared to other approaches 
specifically developed for those benchmark functions. 

7 Conclusion 

In this paper, a synthesis approach is proposed in order to reduce logical depth for architectures with limited 
interactions which applies a cycle-based approach to synthesize a given specification. The proposed method 
focuses on the interaction cost and depth besides the traditional quantum cost metric as a multi-objective 
view in the large picture. To achieve this, we redesigned the elementary cycles in [10] with negative controls 
and limited interaction between gate lines. Moreover, a new parallel circuit structure was proposed for 
reversible logic in the presence of several ancillae registers. Altogether, the mentioned structure, which can 
be used with other synthesis methods, filling with the proposed cycle-based synthesis method for interaction 
cost leads to our whole flow for depth-optimized reversible circuit synthesis. 

A given permutation is written as a set of disjoint cycles to be used in the proposed parallel circuit 
structure. Then, the resulting cycles are distributed among the available n-line registers based on the bin 
packing problem. The cycles are then synthesized on the assigned registers independently. Our experiments 
and analysis show the effectiveness of the proposed approach with and without the interaction cost limita- 
tions for the attempted benchmarks and in the worst-case. 
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